[ML] WordCloud with NLTK

2019-1-28 Mon 15:11

ML►NLP

어간 추출 stemming: 단순 어미 제거, 즉 정확한 어간 아님
원형 복원 lemmatizing: 같은 의미 지니는 여러 단어를 사전형으로 통일.
- 품사 part of speech 지정시, 더 정확
품사 부착 part-of-speech tagging
품사 POS 구분: 낱말을 문법적 기능, 형태, 뜻에 따라 구분
NLTK는 Penn Treebank Tagset 채택
- NNP: 단수 고유명사
- VB: 동사
- VBP: 동사 현재형
- TO: 전치사
- NN: 명사
- DT: 관형사

cf. pos tagging: text pre-processing 연습

scikit-learn 자연어 분석시 “같은 토큰/다른 품사” = 다른 토큰
처리방법
- convert to “토큰/품사”

4. text class

plot: 단어token의 사용 빈도 그래프화
dispersion_plot: 단어가 사용된 위치 시각화
- eg. 소설의 등장인물 등장 위치
concordance: lines 입력 갯수만큼 해당 문장 display
similar: 해당 단어와 비슷한 문맥에서 사용된 단어

5. FreqDist

FreqDist: 문서에 사용된 단어의 사용빈도 정보 담는 class
return: {'word': frequency}
N(): 전체 단어수
freq("word"): 확률
most_common: 출현빈도 높은 단어
5.1 사용법1)
Text class의 vocab으로 추출
5.2 사용법2)
말뭉치에서 추려낸 단어로 FreqDist class 객체 생성
- 예) Emma.txt corpus에서 사람(NNP, 고유대명사)만 추출 & apply stop words
most_common: 출현빈도 높은 단어

6. wordcloud

FreqDist 활용
단어 빈도수에 따른 시각화

내용

1. 말뭉치(corpus)

1 2	import nltk nltk.download('book', quiet=True)

True

1	from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

1	nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

1 2	raw = nltk.corpus.gutenberg.raw('bryant-stories.txt') print(raw[:300])

[Stories to Tell to Children by Sara Cone Bryant 1918]


TWO LITTLE RIDDLES IN RHYME


     There's a garden that I ken,
     Full of little gentlemen;
     Little caps of blue they wear,
     And green ribbons, very fair.
           (Flax.)

     From house to house he goes,
     A me

2. 토큰생성(tokenizing)

sentence unit

sent_tokenize: return sentence

1 2	from nltk.tokenize import sent_tokenize sent_tokenize(raw[:300])

["[Stories to Tell to Children by Sara Cone Bryant 1918] \r\n\r\n\r\nTWO LITTLE RIDDLES IN RHYME\r\n\r\n\r\n     There's a garden that I ken,\r\n     Full of little gentlemen;\r\n     Little caps of blue they wear,\r\n     And green ribbons, very fair.",
 '(Flax.)',
 'From house to house he goes,\r\n     A me']

word unit

word_tokenize
= TreebankWordTokenizer

1 2	from nltk.tokenize import word_tokenize word_tokenize("this's, a, test! ha.")

['this', "'s", ',', 'a', ',', 'test', '!', 'ha', '.']

1
2
3

from nltk.tokenize import TreebankWordTokenizer
tree = TreebankWordTokenizer()
tree.tokenize("this's, a, test! ha.")

['this', "'s", ',', 'a', ',', 'test', '!', 'ha', '.']

WordPunctTokenizer

1
2
3

from nltk.tokenize import WordPunctTokenizer
punct = WordPunctTokenizer()
punct.tokenize("this's, a, test! ha.")

['this', "'", 's', ',', 'a', ',', 'test', '!', 'ha', '.']

RegexpTokenizer

from nltk.tokenize import RegexpTokenizer
pattern = "[\w]+"
retokenize = RegexpTokenizer(pattern)
retokenize.tokenize(raw[50:100])

['918', 'TWO', 'LITTLE', 'RIDDLES', 'IN', 'RHYME', 'T']

3. 형태소 분석

어간 추출 stemming: 단순 어미 제거, 즉 정확한 어간 아님
원형 복원 lemmatizing: 같은 의미 지니는 여러 단어를 사전형으로 통일.
- 품사 part of speech 지정시, 더 정확
품사 부착 part-of-speech tagging

1	words = retokenize.tokenize(raw[1300:2000])

stemming

1
2
3

from nltk.stem import PorterStemmer
st = PorterStemmer()
[(w, st.stem(w)) for w in words][:15]

[('said', 'said'),
 ('a', 'a'),
 ('little', 'littl'),
 ('soft', 'soft'),
 ('cheery', 'cheeri'),
 ('voice', 'voic'),
 ('and', 'and'),
 ('I', 'I'),
 ('want', 'want'),
 ('to', 'to'),
 ('come', 'come'),
 ('in', 'in'),
 ('N', 'N'),
 ('no', 'no'),
 ('said', 'said')]

1
2
3

from nltk.stem import LancasterStemmer
st = LancasterStemmer()
[(w, st.stem(w)) for w in words][:15]

[('said', 'said'),
 ('a', 'a'),
 ('little', 'littl'),
 ('soft', 'soft'),
 ('cheery', 'cheery'),
 ('voice', 'voic'),
 ('and', 'and'),
 ('I', 'i'),
 ('want', 'want'),
 ('to', 'to'),
 ('come', 'com'),
 ('in', 'in'),
 ('N', 'n'),
 ('no', 'no'),
 ('said', 'said')]

lemmatizing

1
2
3

from nltk.stem import WordNetLemmatizer
lm = WordNetLemmatizer()
[(w, lm.lemmatize(w)) for w in words][:15]

[('said', 'said'),
 ('a', 'a'),
 ('little', 'little'),
 ('soft', 'soft'),
 ('cheery', 'cheery'),
 ('voice', 'voice'),
 ('and', 'and'),
 ('I', 'I'),
 ('want', 'want'),
 ('to', 'to'),
 ('come', 'come'),
 ('in', 'in'),
 ('N', 'N'),
 ('no', 'no'),
 ('said', 'said')]

pos tagging

품사 POS 구분: 낱말을 문법적 기능, 형태, 뜻에 따라 구분
NLTK는 Penn Treebank Tagset 채택
- NNP: 단수 고유명사
- VB: 동사
- VBP: 동사 현재형
- TO: 전치사
- NN: 명사
- DT: 관형사

1
2
3

from nltk.tag import pos_tag
sentence = sent_tokenize(raw[203:400])[0]
sentence

'And green ribbons, very fair.'

1 2	word = word_tokenize(sentence) word

['And', 'green', 'ribbons', ',', 'very', 'fair', '.']

pos_tag

1 2	tagged_list = pos_tag(word) tagged_list

[('And', 'CC'),
 ('green', 'JJ'),
 ('ribbons', 'NNS'),
 (',', ','),
 ('very', 'RB'),
 ('fair', 'JJ'),
 ('.', '.')]

1	nltk.help.upenn_tagset('JJ')

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...

filtering

1 2	cc_list = [t[0] for t in tagged_list if t[1] == "CC"] cc_list

['And']

untag: return word

1 2	from nltk.tag import untag untag(tagged_list)

['And', 'green', 'ribbons', ',', 'very', 'fair', '.']

pos tagging: text pre-processing 연습

scikit-learn 자연어 분석시 “같은 토큰/다른 품사” = 다른 토큰
처리방법
- convert to “토큰/품사”

def tokenizer(doc):
    return ["/".join(p) for p in tagged_list]

tokenizer(sentence)

['And/CC', 'green/JJ', 'ribbons/NNS', ',/,', 'very/RB', 'fair/JJ', './.']

4. text class

plot: 단어token의 사용 빈도 그래프화
dispersion_plot: 단어가 사용된 위치 시각화
- eg. 소설의 등장인물 등장 위치
concordance: lines 입력 갯수만큼 해당 문장 display
similar: 해당 단어와 비슷한 문맥에서 사용된 단어

1 2	from nltk import Text text = Text(retokenize.tokenize(raw))

plot: 단어token의 사용 빈도 그래프화

1 2	text.plot(30) plt.show()

png

dispersion_plot: 단어가 사용된 위치 시각화
- eg. 소설의 등장인물 등장 위치

raw = nltk.corpus.gutenberg.raw('austen-emma.txt')
text = Text(retokenize.tokenize(raw))

text.dispersion_plot(['Emma', 'Knightly', 'Frank', 'Jane', 'Robert'])
plt.show()

png

concordance: lines 입력 갯수만큼 해당 문장 display

1	text.concordance('Emma', lines=5)

Displaying 5 of 865 matches:
 Emma by Jane Austen 1816 VOLUME I CHAPTER
 Jane Austen 1816 VOLUME I CHAPTER I Emma Woodhouse handsome clever and rich w
f both daughters but particularly of Emma Between _them_ it was more the intim
nd friend very mutually attached and Emma doing just what she liked highly est
 by her own The real evils indeed of Emma s situation were the power of having

similar: 해당 단어와 비슷한 문맥에서 사용된 단어

1	text.similar('Emma', 10)

she it he i harriet you her jane him that

5. FreqDist

FreqDist: 문서에 사용된 단어의 사용빈도 정보 담는 class
return: {'word': frequency}

사용법1)

Text class의 vocab으로 추출

1 2	fd = text.vocab() type(fd)

nltk.probability.FreqDist

사용법2)

말뭉치에서 추려낸 단어로 FreqDist class 객체 생성
- 예) Emma.txt corpus에서 사람(NNP, 고유대명사)만 추출 & apply stop words

1	nltk.help.upenn_tagset('NNP')

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...

1 2	emma_tokens = pos_tag(retokenize.tokenize(raw)) len(emma_tokens), emma_tokens[0]

(161983, ('Emma', 'NN'))

from nltk import FreqDist

stopwords = ['Mr.', 'Mrs.', 'Miss', 'Mr', 'Mrs', 'Dear']
names_list = [t[0] for t in emma_tokens if t[1] == "NNP" and t[0] not in stopwords]
fd_names = FreqDist(names_list)
fd_names

FreqDist({'Emma': 830, 'Harriet': 491, 'Weston': 439, 'Knightley': 389, 'Elton': 385, 'Woodhouse': 304, 'Jane': 299, 'Fairfax': 241, 'Churchill': 223, 'Frank': 208, ...})

N(): 전체 단어수
freq("word"): 확률

1	fd_names.N(), fd_names['Emma'], fd_names.freq('Emma')

(7863, 830, 0.10555767518758744)

most_common: 출현빈도 높은 단어

1	fd_names.most_common(5)

[('Emma', 830),
 ('Harriet', 491),
 ('Weston', 439),
 ('Knightley', 389),
 ('Elton', 385)]

6. wordcloud

FreqDist 활용
단어 빈도수에 따른 시각화

from wordcloud import WordCloud
wc = WordCloud(width=1000, height=600, background_color='white', random_state=0)
plt.imshow(wc.generate_from_frequencies(fd_names))
plt.axis('off')
plt.show()

Henry's blog

Step by step

[ML] WordCloud with NLTK

목차

1. 말뭉치(corpus)

2. 토큰생성(tokenizing)

3. 형태소 분석

4. text class

5. FreqDist

5.1 사용법1)

5.2 사용법2)

6. wordcloud

내용

1. 말뭉치(corpus)

2. 토큰생성(tokenizing)

sentence unit

word unit

3. 형태소 분석

stemming

lemmatizing

pos tagging

pos tagging: text pre-processing 연습

4. text class

5. FreqDist

사용법1)

사용법2)

6. wordcloud